Data preparation report

PreSA (2020-2023) and PostSA (2023)
__________________________________________________________________

Abstract

This report details the data preparation workflow of the project. In order to ensure end-to-end reproducibility in our data management workflow, the code included in this report is used to operate directly on the raw data downloaded from Qualtrics©.

Data files

The raw data files are downloaded from Qualtrics© into a folder called data_raw with their default Qualtrics© names, which includes the survey name plus the date and time of the download. The data is exported from Qualtrics© as SPSS .sav data files with the extra long labels option.

The Qualtrics© questionnaire was based on a design codeplan saved in an Excel .xlsx file, which is stored in a folder named study_design. The same folder also contains a spreadsheet with details about the survey participants who also participated in the follow-up qualitative interview phase of the data collection.

The data_raw and study_design folders contain the following files:

path

size

modified

data_raw/postSA_YSJ_2023_3+February+2024_21.40.sav

7.29M

2024-02-03 21:41:12

data_raw/preSA_YSJ_2023_12+April+2024_12.08.sav

23.44M

2024-04-12 12:08:56

data_raw/Study Abroad Expectations_September 11, 2023_17.06.sav

43.24M

2023-09-12 00:09:39

data_raw/Study+Abroad+Expectations+–+External_September+11,+2023_17.09.sav

23.82M

2023-09-12 00:09:19

study_design/MAXOUT-SA_Codeplan.xlsx

55.82K

2024-04-18 16:43:18

study_design/MAXOUT-SA_Interviewees.xlsx

12.83K

2024-04-25 13:21:27

Printed on 09 May 2024

The code below sets up functional links to these files in R:

#### File paths ----------------------------------------------------------------------------------

(datafiles    <- list.files("data_raw", pattern = "\\.sav"))          # List `.sav` files
(designfiles  <- list.files("study_design", pattern = "\\.xlsx"))     # List `.xlsx` files

# (interviewees <- list.files("data_qualitative", pattern = "\\.xlsx")) # List `.xlsx` files

## 2020 pre-SA
preSA20_ysj_path   <- file.path("data_raw", grep("Study Abroad Expectations", datafiles, value = TRUE))     # 2020 YSJ student data
preSA20_ext_path   <- file.path("data_raw", grep("External", datafiles, value = TRUE))                      # 2020 Non-YSJ student data

## 2023 pre-SA
preSA23_ysj_path   <- file.path("data_raw", grep("preSA_YSJ_2023", datafiles, value = TRUE))                # 2023 YSJ pre-SA data

## 2023 post-SA
postSA23_ysj_path  <- file.path("data_raw", grep("postSA_YSJ_2023", datafiles, value = TRUE))               # 2023 YSJ post-SA data

## Design
codeplan_path      <- file.path("study_design", grep("Codeplan", designfiles, value = TRUE))                # Excel survey codebook
interviewees_path  <- file.path("study_design", grep("Interviewees", designfiles, value = TRUE))            # Excel list of interviewees

Pre-SA datasets

Questionnaire/variable differences

The 2020 pilot data collection consists of a YSJ and an external dataset. The difference between the two questionnaires was a single item that asked YSJ respondents whether they would also be interested in participating in a qualitative interview study. Qualitative data was not collected from external respondents:

In YSJ but not in External data:
[1] "ysj_interview"
[1] "Accept to participate in an interview"

The position of the variables in the dataset is:
[1] 303

The 2023 pre-SA survey had several differences compared to the 2020 questionnaire:

2020 2023

In which academic year do you expect to go on a Study Abroad year?

  • 2020/2021
  • 2021/2022
  • 2022/2023
  • 2023/2024

In which year do you expect to go on a Study Abroad year/semester?

  • 2nd year
  • 3rd year

Who do you expect to socialize with most while on Study Abroad?

  1. “Mainly with people/colleagues from JP/KO”
  2. “Mainly with friends/colleagues from my UK university”
  3. “Mainly with other foreign students”
  4. “Mainly with students from English speaking countries”
  5. “I don’t know”

Who do you expect to socialize with most while on Study Abroad?

  1. “Mainly with people/colleagues from JP/KO”
  2. “Mainly with friends/colleagues from my UK university”
  3. “Mainly with foreign students from the UK”
  4. “Mainly with foreign students from other English speaking countries”
  5. “Mainly with foreign students from non-English-speaking countries”
  6. “I don’t know”
Block of 16 questions on “imagined self” was not asked
The email question asked for “university email address” specifically

The sayr and expect_socialise variables from 2023 were given the _23 suffix to their variable names (in the Codeplan document).

Data import

The code below imports into R the pre-SA data (.sav files), the variable information (names, labels) from the codeplan document (survey_design/SA_codeplan.xlsx), and information about which respondents also participated in qualitative follow-up interviews (data_qualitative/MAXOUT-SA-Interviewees.xlsx):

#### Import from raw ----------------------------------------------------------------------------------------

## Pre-SA Codeplan
codeplan_pre <- read_excel(codeplan_path, sheet = "preSAvars")

## Interviewees
interviewees <- read_excel(interviewees_path) |> data_select(c("Random_ID", "interviewed_preSA", "interviewed_postSA"))

## 2020 pre-SA YSJ
preSA20_ysj <- read_spss(preSA20_ysj_path)                                          # Import from spss
names(preSA20_ysj) <- codeplan_pre$varname_pre20                                    # Assign variable names
sjlabelled::set_label(preSA20_ysj) <- codeplan_pre$varlabel_pre20                   # Assign variable labels

## 2020 pre-SA External
preSA20_ext <- read_spss(preSA20_ext_path)
names(preSA20_ext) <- codeplan_pre$varname_pre20[-303]                              # Assign variable names removing YSJ-specific var
sjlabelled::set_label(preSA20_ext) <- codeplan_pre$varlabel_pre20[-303]             # Assign variable labels removing YSJ-specific var

## 2023 pre-SA YSJ
preSA23_ysj <- read_spss(preSA23_ysj_path)
names(preSA23_ysj) <- na.omit(codeplan_pre$varname_pre23)                           # Assign variable names
sjlabelled::set_label(preSA23_ysj) <- na.omit(codeplan_pre$varlabel_pre23)          # Assign variable labels

Data management variables

Before merging the datasets, we create an additional cohort column which records the academic year of the pre-SA data collection. The survey for the YSJ study had been kept open for several months, spanning the second semester of the 2019/2020 academic year and the first semester of the 2020/2021 AY, and therefore the preSA20_ysj dataset contains responses from two student cohorts (2019/2020 and 2020/2021). Data for the preSA20_ext dataset should only contain responses from the 2019/2020 student cohort due to the outbreak of the Covid-19 pandemic, which interfered with the data collection since international travel - and Study Abroad years - were put on hold. However, there is one response that was submitted in March 2021. This response will be removed from the dataset:

#### Create `cohort` column ------------------------------------------------------------------------------------

preSA20_ysj <- preSA20_ysj |> 
  mutate(cohort = case_when(StartDate < as.POSIXct("2020-09-01") ~ "2019/2020",
                            StartDate >= as.POSIXct("2020-09-01") ~ "2020/2021"))

preSA20_ext <- preSA20_ext |> 
  mutate(cohort = "2019/2020") |> 
  dplyr::filter(StartDate < as.POSIXct("2020-09-01"))  # remove response dating "2021-03-20 16:11:51"

preSA23_ysj <- preSA23_ysj |> 
  mutate(cohort = "2023/2024")

Merging the pre-SA datasets

Merging the three datasets should therefore have ncol(preSA20_ysj) + 3 = 312 variables.

The code below merges the datasets and checks its dimensions:

#### Merge all pre-SA datasets from 2020 and 2023 ---------------------------------------------------------------
preSA <- sjmisc::add_rows(preSA20_ysj,   # `this`sjmisc::add_rows` keeps `label` attribute but not other non-relevant attributes
                          preSA20_ext,    ## `dplyr::bind_rows` removes the variable label attributes
                          preSA23_ysj)    ## `datawizard::data_merge` keeps all SPSS-specific attributes (`display_width`, `format.spss`)

### Check dimensions of merged dataframe                                                                      
dim(preSA)
[1] 243 312
Number of columns as expected: TRUE

Replacing piped text

The Qualtrics© questionnaire included piped text for Japanese and Korean language students, and these appear with non-human-readable characters in the variable and value labels, so we replace these characters with the phrase “JP/KO”:

#### Replace shortcodes for "Japanese" and "Korean" -----------------------------------------------------

## Get all value labels as list
labs <- sjlabelled::get_labels(preSA)

## Change all the values labels in all the variables in list
labs <- lapply(labs, function(x) str_replace_all(x,
                                                 '\\$[^\\}]*\\}',
                                                 'JP/KO'))

## Apply changed labels to dataset; keep labels as attribute (don't do `as_label(as.numeric)` beforehand)
preSA <- sjlabelled::set_labels(preSA, labels = labs, force.labels = TRUE)               

For example, looking at the A1_comjpko variable before and after the change:

Speaks with JP/KO friends in JP/KO (A1_comjpko)
Value Label N Raw % Valid % Cum. %
1 ${lm://Field/1} 38 15.64 100 100
<NA> <NA> 205 84.36 <NA> <NA>
Speaks with JP/KO friends in JP/KO (A1_comjpko)
Value Label N Raw % Valid % Cum. %
1 JP/KO 38 15.64 100 100
<NA> <NA> 205 84.36 <NA> <NA>
#### Replace shortcodes for "Japanese" and "Korean" -----------------------------------------------------

## Get all value labels as list
labs <- sjlabelled::get_labels(preSA)

## Change all the values labels in all the variables in list
labs <- lapply(labs, function(x) str_replace_all(x,                             
                                                 '\\$[^\\}]*\\}', 
                                                 "JP/KO"))

## Apply changed labels to dataset; keep labels as attribute (don't do `as_label(as.numeric)` beforehand)
preSA <- sjlabelled::set_labels(preSA, labels = labs, force.labels = TRUE)   

Converting categorical variables

We convert the values of all labelled factor (categorical) variables to their labels, so that later we can manipulate the values directly as text.

#### Convert labelled factor variables ------------------------------

## This keeps the unused labels as well
preSA <- preSA |> 
    mutate(across(where(is.factor), sjlabelled::as_numeric),
           across(everything(), sjlabelled::as_label))

## This keeps only the labels of categories that had valid responses
# preSA_alt <- preSA |> 
#     mutate(across(where(is.factor), labels_to_levels))

Combining Japanese and Korean versions of variables

The survey questions were broken down by language studied (Japanese/Korean), and we have duplicate variables coding the same question (prefixed with “A1_” for Japanese and “A2_” for Korean). With the code below we combine these variables:

#### Unify variables split by language ---------------------------------------------------------

korean <- preSA |> 
  dplyr::filter(language == "Korean") |>
  select(!starts_with("A1")) |> 
  rename_with(stringr::str_replace,
              pattern = "A2_", replacement = "",
              matches("A2_"))


japanese <- preSA |> 
  dplyr::filter(language == "Japanese") |>
  select(!starts_with("A2")) |> 
  rename_with(stringr::str_replace,
              pattern = "A1_", replacement = "",
              matches("A1_"))

missing <- preSA |>
  dplyr::filter(is.na(language)) |>                   # 13 missing answers to language
  datawizard::remove_empty_columns()                  # remove all empty columns

preSA <- sjmisc::add_rows(japanese, korean, missing)                             

Removing incomplete responses

There were 13 responses with missing data on language. Since the language studied was a core compulsory-answer item, these 13 cases were also unfinished responses. Of the 243 responses in the pre-SA dataset 183 have been finished and submitted. We keep only finished cases:

#### Keep only completed and submitted responses ---------------------------

preSA <- preSA |>
  dplyr::filter(Finished == "True")

Removing duplicates

E-mail addresses were requested primarily for the purposes of contacting students who opted in for participation in a follow-up qualitative interview and/or future (post-SA) rounds of data collection, as well as for contacting the winner of the randomly selected participation prize. Respondent e-mail addresses and IP Addresses are also helpful for identifying any data reliability issues, such as duplicate responses (n.b. the IPAddress collected by Qualtrics© is “external”, so those connecting to the same network will share an IP). We find four email addresses with duplicate responses, and we will keep the earlier responses. The reason for this choice is that the information from the later responses could be contaminated by having previously completed the survey already (practice effects). Incidentally, the earlier responses also have fewer missing answers (albeit marginally, a difference of one in two cases).

The code below checks duplicates and removes them (suppressing the output for reasons of anonymity):

#### Identify case duplicates (by email) -------------------------------------------------------

preSA |> 
  mutate(dupe = duplicated(preSA$email)) |> 
  janitor::get_dupes(email) |> 
  select(email, IPAddress, dupe_count, dupe, Random_ID, StartDate) |> 
  datawizard::data_to_wide(id_cols = c("email", "dupe_count"), names_from = "dupe", values_from = "Random_ID")

# or
preSA |> 
  mutate(duplicate = duplicated(preSA$email)) |> 
  data_duplicated(select = "email") |> 
  dplyr::add_count(email, name = "count_duplicates") |> 
  data_select(c("email", "IPAddress", "count_duplicates", "duplicate", "Random_ID", "StartDate", "count_na"))

## Counting IPAddress is less useful due to shared addresses at campus/dormitory
preSA |> 
  janitor::get_dupes(IPAddress) |> 
  select(email, IPAddress, dupe_count, Random_ID, StartDate)
#### Identify case duplicates (by email) -------------------------------------------------------

preSA |> 
  mutate(dupe = duplicated(preSA$email)) |> 
  janitor::get_dupes(email) |> 
  select(email, IPAddress, dupe_count, dupe, Random_ID, StartDate) |> 
  datawizard::data_to_wide(id_cols = c("email", "dupe_count"), names_from = "dupe", values_from = "Random_ID") |> 
  datawizard::data_rename(c("dupe_count", "FALSE", "TRUE"), c("No. of duplicate emails", "ID_first", "ID_second")) |> 
  select(!email) |> 
  tinytable::tt()
tinytable_xwsd8y3yuu85i95je9ii
No. of duplicate emails ID_first ID_second
2 9419 8436
2 9514 8175
2 95339 70145
2 84953 44422
#### Will delete the later responses (incidentally, these also have fewer NAs) -----------------
`%not_in%` = Negate(`%in%`)

preSA <- preSA |> 
  data_filter(Random_ID %not_in% c("8436", "8175", "70145", "44422"))    # keeps original "rownames"; `rownames(preSA) <- NULL` to renumber

# or:  
# preSA <- preSA |> 
#   dplyr::filter(!Random_ID %in% c("8436", "8175", "70145", "44422"))   # renumbers "rownames"
# or:
# preSA_alt <- data_unique(preSA, select = "email", keep = "first")      ## Deletes all attributes!!

This leaves us with 179 responses/cases/rows.

We can also check whether any Random_ID numbers have been allocated multiple times (unfortunately, Qualtrics© doesn’t have a system to fine-tune the randomisation of numbers…). We find that the Random_ID number 3591 has been allocated twice. One was allocated to an external participant, so we replace it by adding a suffix consisting of two 0s to it.

#### Identify identical Random_IDs --------------------------------------------------------
preSA |> 
  janitor::get_dupes(Random_ID) |> 
  select(Random_ID, cohort, uni, language) |> 
  tinytable::tt()
tinytable_elu072fk4laoiddac2ex
Random_ID cohort uni language
3591 2020/2021 York St John University Japanese
3591 2019/2020 Cardiff University Japanese
#### Fix identical Random_IDs -------------------------------------------------------------
preSA$Random_ID[preSA$uni == "Cardiff University" & preSA$Random_ID == "3591" ] <- "359100"

Variable selection

Export labels

Before removing unnecessary variables from the dataset, we can export all the variable and value labels so that we can use them in future survey designs. We use the 2023/2024 cohort data for this purpose, as that will serve as the basis for future data collections (code folded).

Click to see code
#### Export label lists ---------------------------------------------------------------------

### Extract variable names and labels
varlabs <- preSA |> 
  dplyr::filter(cohort=="2023/2024") |> 
  sjlabelled::get_label() |> 
  as.data.frame() |> 
  rownames_as_column() |> 
  rename(name = 1, labels = 2)

### Extract value labels
values <- preSA |> 
  dplyr::filter(cohort=="2023/2024") |> 
  recode_values(select = c(IPAddress:   UserLanguage), 
                recode = list("Qualtrics_auto" = "min:max"), default = "Qualtrics_auto") |> 
  recode_values(select = c(ends_with("_txt"), postcode, email, Random_ID),
                recode = list("open-ended-text" = "min:max"), default = "open-ended-text") |> 
  recode_values(select = c(age, finishedschool),
                recode = list("dropdown-options" = "min:max"), default = "dropdown-options") |>
  sjlabelled::get_labels() |> 
  enframe() |> 
  tidyr::unnest_wider(value, names_sep = ",")

preSA_survey_labels <- left_join(varlabs, values, by = "name") 
rm(varlabs, values)

### Remove unwanted entries
preSA_survey_labels <- preSA_survey_labels |> column_to_rownames("name")
preSA_survey_labels["sayr" , ] <- preSA_survey_labels["sayr_23" , ]
preSA_survey_labels["expect_socialise" , ] <- preSA_survey_labels["expect_socialise_23" , ]
preSA_survey_labels <- preSA_survey_labels |> rownames_to_column("name")
preSA_survey_labels <- preSA_survey_labels[-c(186:194) , ]

### Export variable and value label list
data_write(preSA_survey_labels, "study_design/preSA_survey_labels.csv", na = "")  

Select out sensitive data

We can select out variables that contain more sensitive information to store separately from the main analysis dataset:

### Select out meta- and safeguarded variables ---------------------------------------------

preSA_sensitive <- preSA |> 
  select(Status:Progress, Finished:UserLanguage, postcode, followup, ysj_interview, email, Random_ID) 

Select out textual data

We can also select out variables that contain text entered in open-ended survey questions. These were identified with the _txt suffix in the Codeplan. We create and select the most useful variables to keep in the textual dataset:

### Get names of textual variables -----------------------------------------------------------------------------------------------

text_variables <- str_subset(names(preSA), pattern = "_txt")

### Select and create variables to keep in the textual dataset -------------------------------------------------------------------

preSA_textual <- preSA |>
  # create text variable concatenating all school types attended
  data_unite(select = contains("school_"), new_column = "schools_combined", remove_na = TRUE, append = TRUE, separator = ", ") |> 
  # data_unite() doesn't want to exclude NAs... bug in the code... have to remove manually...
  data_modify(.at = "schools_combined", .modify = function(x) {text_remove(x, ", NA")}) |> 
  data_modify(.at = "schools_combined", .modify = function(x) {text_remove(x, "NA, ")}) |> 
  # select variables to keep
  data_select(c(Random_ID, uni, cohort, language, sayr, sayr_23, 
                gender, age, intstudnt, bornuk, pargrad, schools_combined,
                text_variables))

Recode textual data

It is more useful to keep a numeric version of the textual variables, which records the number of words in the answers provided, rather that the answers themselves. To distinguish between these and the original variables, we add the suffix _nwords to their names:

### Function to count all "word" characters, first converting empty strings to NA
wordcounts <- function(x) {
  label <- get_label(x)               # save var labels
  x |> convert_to_na(na = "") |>      # convert to NA to avoid 0 values
       str_count('\\w+') |>           # count all "words"
  set_label(label)                    # reassign the saved labels
  }

#### Recode textual variables to wordcount numeric variables; add suffix to var name -----
preSA <- preSA |> 
         data_modify(.at = text_variables, .modify = wordcounts) |> 
         data_addsuffix(pattern = "_nwords", select = text_variables)

Insert information on interviewees

We add two additional variables recording whether the respondent has also participated in a qualitative in-depth interview at the pre-SA and post-SA stage:

preSA <- data_merge(preSA, interviewees, id = "Random_ID")
preSA_textual <- data_merge(preSA_textual, interviewees, id = "Random_ID")

Remove, add and relocate variables

Final variable preparation tasks. We exclude the sensitive data from the dataset, as well as the PIS variable and topic header variables; we add a variable counting the number of missing answers for each respondent; and we reorder variables:

### Remove, add and relocate variables ---------------------------------------------------------

preSA <- preSA |> 
  # remove variables
  select(!c(StartDate, EndDate, Status, IPAddress, Progress, RecordedDate:UserLanguage, 
            postcode, email, followup, ysj_interview,    
            Finished, pis,  
            contains("Topics"))) |>   
  # add of count missing answers
  rowwise() |> 
    mutate(missing_answers = sum(is.na(across(everything())))) |> 
  ungroup() |> 
  # relocate
  relocate(Random_ID, Duration, missing_answers, interviewed_preSA, interviewed_postSA, uni, cohort) |> 
  relocate(sayr_23, .after = sayr) |> 
  relocate(expect_socialise_23, .after = expect_socialise)

Analysis dataset check

The final analysis dataset contains 179 cases/rows and 172 variables/columns. 143 responses are from York St John University students. There are 2 variables containing only NA values: (proglength_txt_nwords, sib4occ_study_txt_nwords). The minimum number of missing answers across the dataset is 26 and the maximum is 62, with a median of 40:

data_tabulate(preSA, missing_answers, include_na = FALSE) |> print_html()
missing_answers (missing_answers) (integer)
Value N Raw % Valid % Cumulative %
26 1 0.56 0.56 0.56
30 2 1.12 1.12 1.68
31 4 2.23 2.23 3.91
32 2 1.12 1.12 5.03
33 4 2.23 2.23 7.26
34 8 4.47 4.47 11.73
35 11 6.15 6.15 17.88
36 11 6.15 6.15 24.02
37 13 7.26 7.26 31.28
38 15 8.38 8.38 39.66
39 14 7.82 7.82 47.49
40 18 10.06 10.06 57.54
41 9 5.03 5.03 62.57
42 5 2.79 2.79 65.36
43 10 5.59 5.59 70.95
44 4 2.23 2.23 73.18
45 1 0.56 0.56 73.74
46 4 2.23 2.23 75.98
47 1 0.56 0.56 76.54
48 5 2.79 2.79 79.33
49 4 2.23 2.23 81.56
50 1 0.56 0.56 82.12
51 4 2.23 2.23 84.36
52 1 0.56 0.56 84.92
53 1 0.56 0.56 85.47
54 6 3.35 3.35 88.83
55 5 2.79 2.79 91.62
56 6 3.35 3.35 94.97
57 1 0.56 0.56 95.53
58 4 2.23 2.23 97.77
59 1 0.56 0.56 98.32
60 2 1.12 1.12 99.44
62 1 0.56 0.56 100.00
total N=179 valid N=179

Data export

We export the analysis dataset with the name preSA2023 to SPSS .sav format in a new folder data_in. We export the sensitive data to a new folder data_lock. We also export the textual data to an Excel sheet and include variable names, labels and their concatenated version as additional rows to make them more informative as column headers in Excel:

## Export datasets ---------------------------------------------------------------------------------

fs::dir_create("data_in")
datawizard::data_write(preSA, "data_in/preSA2023.sav")

fs::dir_create("data_lock")
datawizard::data_write(preSA_sensitive, "data_lock/preSA2023_sensitive.sav")

## Export qualitative dataset ----------------------------------------------------------------------


# For this we use a function I wrote that modifies the behaviour of datawizard::data_write() to allow variable labels to be saved as
# the first row in the exported text file. This is achieved with an additional optional setting `labels_to_row = TRUE`. 
# If `labels_to_row` is not specified, the function does the same as datawizard::data_write()

# Import the function from GitHub Gist

devtools::source_gist("https://gist.github.com/CGMoreh/a706954fb56cf8cc4a1ddc53ac1a4737", filename = "my_data_write.R")

data_write(preSA_textual, "data_in/preSA2023_textual.xlsx", labels_to_row = TRUE)

Post-SA datasets

The aim of the post-SA data collection was to provide a sense of how opinions have changed following the Study Abroad year. Many survey questions have been repeated (almost) as they were first asked in the pre-SA data collection, and a more limited number of new items were also introduced. Variables that have an equivalence with the pre-SA data have received the same name stub as in the pre-SA dataset, followed by the **_post** suffix. New variables without an equivalent in the pre-SA dataset have received the post_ prefix. A number of variables that were programmatically included in the Qualtrics© survey have the same name as in the pre-SA dataset and will be excluded or modified, keeping only the shared variables needed for matching and merging the responses from the same individuals.

Data import

The code below imports the post-SA data (.sav files) and the variable information (names, labels) from the Codeplan spreadsheet (survey_design/SA_codeplan.xlsx) into R:

#### Import from raw -----------------------------------------------------------------------------------

## Post-SA codeplan
codeplan_post <- read_excel(codeplan_path, sheet = "postSAvars")

## 2023 post-SA YSJ
postSA23_ysj <- read_spss(postSA23_ysj_path)
names(postSA23_ysj) <- na.omit(codeplan_post$varname_post23)                    # Assign variable names
sjlabelled::set_label(postSA23_ysj) <- na.omit(codeplan_post$varlabel_post23)   # Assign variable labels

The raw postSA23_ysj dataset has 34 answers and 117 variables.

Data cleansing

We fix Qualtrics© shortcodes in labels and convert categorical variable types. We keep the 32 completed responses only. We also remove several variables that were reused/pre-filled automatically from the pre-SA survey, keeping only the pre-filled Random_ID variable for merging. One of the pre-filled Random_IDs corresponds to one of the duplicates that were deleted from the pre-SA dataset. We replace this ID with that of the case kept in the analysis for the purpose of merging. We remove sensitive data.

#### Replace shortcodes for "Japanese" and "Korean" -----------------------------------------------------------

labs <- sjlabelled::get_labels(postSA23_ysj)
labs <- lapply(labs, function(x) str_replace_all(x,                             
                               '\\$[^\\}]*\\}', 
                               "JP/KO"))

postSA23_ysj <- postSA23_ysj |> 
  sjlabelled::set_labels(labels = labs, force.labels = TRUE)  

#### Export label lists ---------------------------------------------------------------------------------------

### Extract variable names and labels
varlabs <- postSA23_ysj |> 
  sjlabelled::get_label() |> 
  as.data.frame() |> 
  rownames_as_column() |> 
  rename(name = 1, labels = 2)

### Extract value labels
values <- postSA23_ysj |> 
  recode_values(select = c(IPAddress_post:UserLanguage_post), 
                recode = list("Qualtrics-auto" = "min:max"), default = "Qualtrics-auto") |> 
  recode_values(select = c(contains("_txt"), pre_course, course, post_email_uni, post_email_personal, Random_ID),
                recode = list("open-ended-text" = "min:max"), default = "open-ended-text") |> 
  # recode_values(select = c(age, finishedschool),
  #               recode = list("dropdown-options" = "min:max"), default = "dropdown-options") |>
  sjlabelled::get_labels() |> 
  enframe() |> 
  tidyr::unnest_wider(value, names_sep = ",")

postSA_survey_labels <- left_join(varlabs, values, by = "name") 
rm(varlabs, values)

### Export variable and value label list------------------------------------------------------------------------
data_write(postSA_survey_labels, "study_design/postSA_survey_labels.csv", na = "")



#### Extract sensitive data ------------------------------------------------------------------------------------
postSA_sensitive <- postSA23_ysj |>
  dplyr::select(c(post_qual_1, post_qual_2,                                          # PIS and eligibility vars
                  post_email_uni, post_aftergrad_followup, post_email_personal,      # Sensitive
                  Random_ID))


#### Select cases and variables --------------------------------------------------------------------------------
postSA <- postSA23_ysj |> 
  ## Convert labelled factor variables
  mutate(across(where(is.factor), sjlabelled::as_numeric),
         across(everything(), sjlabelled::as_label)) |> 
  ## Select valid responses
  dplyr::filter(Finished_post == "True" &
                post_qual_1 == "I agree to take part" &
                post_qual_2 == "I have done a full year of study abroad") |> 
  ## Select useful variables
  dplyr::select(!c(StartDate_post:Progress_post, Finished_post:UserLanguage_post,    # Qualtrics variables
                  uni_post, pre_course, course:SAcountry,                            # Prefilled from Pre-SA survey
                  post_qual_1, post_qual_2,                                          # PIS and eligibility vars
                  post_email_uni, post_aftergrad_followup, post_email_personal)) |>  # Sensitive
  # add of count missing answers
  rowwise() |> 
    mutate(missing_answers_post = sum(is.na(across(everything())))) |> 
  ungroup() |> 
  data_relocate(missing_answers_post)

#### Replace a `Random_ID`
postSA$Random_ID[postSA$Random_ID == "8436" ] <- "9419"

Merging pre-SA and post-SA data

We merge thepostSA23_ysj dataset with the responses from the same individuals from the preSA dataset:

postSA <- data_merge(preSA, postSA, join = "inner", by = "Random_ID")

Data export

We export the analysis dataset with the name postSA2023 to SPSS .sav format to the folder data_in, and the sensitive data to the folder data_lock:

## Export datasets ------------------------------------------------------------------------------

fs::dir_create("data_in")
datawizard::data_write(postSA, "data_in/postSA2023.sav")

fs::dir_create("data_lock")
datawizard::data_write(postSA_sensitive, "data_lock/postSA2023_sensitive.sav")

Export data files

Finally, we copy over the non-sensitive analysis datasets to a folder shared by the research team. These datasets can for the basis of replication datasets which can be made publicly available on data repositories to supplement publications from the project. The analysis datasets also contain textual responses to open-ended survey questions, which should also be removed from any public datasets, or reviewed for sensitive information before making it publicly available.

shared_path <- "D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey"
fs::dir_copy("data_in", path(shared_path, "data_in"), overwrite = TRUE)
fs::dir_copy("study_design", path(shared_path, "study_design"), overwrite = TRUE)

fs::dir_info(c(fs::path(shared_path, "data_in"), 
               fs::path(shared_path, "study_design")),
             )[, c(1,3,5)] |> 
  dplyr::rename(modified = modification_time) |> 
  flextable::qflextable() |> 
  flextable::fontsize(size = 10) |> flextable::width(1, 7) |> 
  flextable::add_footer_lines(paste("Printed on", format(Sys.Date(), "%d %B %Y")))

path

size

modified

D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey/data_in/postSA2023.sav

125.4K

2024-05-09 21:46:00

D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey/data_in/preSA2023.sav

73.9K

2024-05-09 21:45:58

D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey/data_in/preSA2023_textual.xlsx

96.4K

2024-05-09 21:46:00

D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey/study_design/MAXOUT-SA_Codeplan.xlsx

55.8K

2024-04-18 16:43:18

D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey/study_design/MAXOUT-SA_Interviewees.xlsx

12.8K

2024-04-25 13:21:27

D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey/study_design/postSA_survey_labels.csv

11.8K

2024-05-09 21:46:00

D:/York St John University/Chisato Danjo - #SA project (shared with project members)/# SURVEY ANALYSIS (QUAL_QUANT)/Survey/study_design/preSA_survey_labels.csv

20.9K

2024-05-09 21:45:57

Printed on 09 May 2024

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.3 (2024-02-29 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United Kingdom.utf8
 ctype    English_United Kingdom.utf8
 tz       Europe/London
 date     2024-05-09
 pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package    * version date (UTC) lib source
 conflicted * 1.2.0   2023-02-01 [1] CRAN (R 4.3.1)
 datawizard * 0.10.0  2024-03-26 [1] CRAN (R 4.3.3)
 devtools   * 2.4.5   2022-10-11 [1] CRAN (R 4.3.1)
 dplyr      * 1.1.4   2023-11-17 [1] CRAN (R 4.3.3)
 flextable  * 0.9.5   2024-03-06 [1] CRAN (R 4.3.3)
 fs         * 1.6.3   2023-07-20 [1] CRAN (R 4.3.1)
 gt         * 0.10.1  2024-01-17 [1] CRAN (R 4.3.3)
 insight    * 0.19.10 2024-03-22 [1] CRAN (R 4.3.3)
 janitor    * 2.2.0   2023-02-02 [1] CRAN (R 4.3.1)
 librarian  * 1.8.1   2021-07-12 [1] CRAN (R 4.3.3)
 readxl     * 1.4.3   2023-07-06 [1] CRAN (R 4.3.1)
 sjlabelled * 1.2.0   2022-04-10 [1] CRAN (R 4.3.1)
 sjmisc     * 2.8.9   2021-12-03 [1] CRAN (R 4.3.1)
 stringr    * 1.5.1   2023-11-14 [1] CRAN (R 4.3.3)
 tibble     * 3.2.1   2023-03-20 [1] CRAN (R 4.3.3)
 tidyr      * 1.3.1   2024-01-24 [1] CRAN (R 4.3.3)
 usethis    * 2.2.3   2024-02-19 [1] CRAN (R 4.3.3)

 [1] C:/Users/cgmoreh/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.3/library

──────────────────────────────────────────────────────────────────────────────